Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hierarchical queues for capacity plugin #3743

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Rui-Gan
Copy link
Contributor

@Rui-Gan Rui-Gan commented Sep 23, 2024

No description provided.

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 23, 2024
@hwdef
Copy link
Member

hwdef commented Sep 23, 2024

ref: #3590

@hwdef
Copy link
Member

hwdef commented Sep 23, 2024

/assign @Monokaix @lowang-bh @shinytang6
Please take a look 👀

@Monokaix
Copy link
Member

Is somewhere validate that job can only be submitted to leaf node?

@hwdef
Copy link
Member

hwdef commented Sep 26, 2024

Please add some E2E test for this function. Because this function is very important.

@Rui-Gan
Copy link
Contributor Author

Rui-Gan commented Sep 29, 2024

Please add some E2E test for this function. Because this function is very important.

I will add it later

@Rui-Gan
Copy link
Contributor Author

Rui-Gan commented Sep 29, 2024

Is somewhere validate that job can only be submitted to leaf node?

I have already added this logic in the validate webhook of the job to ensure that jobs can only be submitted to leaf queues.

@googs1025
Copy link
Member

/cc

@hwdef
Copy link
Member

hwdef commented Oct 3, 2024

Please consider this scenario:

  1. Use default config create a cluster (./hack/local-up-volcano.sh)
  2. This configuration uses the proportion plugin by default and automatically creates a default queue
  3. Switch to the capacity plugin and enable hierarchical queues

I believe this is how most existing users use hierarchical queues.

result:

  1. root queue will not be created.
  2. scheduler panic:
2024/10/03 14:13:01 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
I1003 14:13:01.520160       1 flags.go:57] FLAG: --add-dir-header="false"
I1003 14:13:01.520174       1 flags.go:57] FLAG: --alsologtostderr="false"
I1003 14:13:01.520176       1 flags.go:57] FLAG: --ca-cert-file=""
I1003 14:13:01.520179       1 flags.go:57] FLAG: --cache-dump-dir="/tmp"
I1003 14:13:01.520181       1 flags.go:57] FLAG: --cache-dumper="true"
I1003 14:13:01.520183       1 flags.go:57] FLAG: --csi-storage="false"
I1003 14:13:01.520185       1 flags.go:57] FLAG: --default-queue="default"
I1003 14:13:01.520187       1 flags.go:57] FLAG: --enable-healthz="true"
I1003 14:13:01.520189       1 flags.go:57] FLAG: --enable-metrics="true"
I1003 14:13:01.520192       1 flags.go:57] FLAG: --feature-gates=""
I1003 14:13:01.520196       1 flags.go:57] FLAG: --healthz-address=":11251"
I1003 14:13:01.520199       1 flags.go:57] FLAG: --ignored-provisioners="[]"
I1003 14:13:01.520210       1 flags.go:57] FLAG: --kube-api-burst="2000"
I1003 14:13:01.520214       1 flags.go:57] FLAG: --kube-api-qps="2000"
I1003 14:13:01.520218       1 flags.go:57] FLAG: --kubeconfig=""
I1003 14:13:01.520221       1 flags.go:57] FLAG: --leader-elect="false"
I1003 14:13:01.520223       1 flags.go:57] FLAG: --leader-elect-lease-duration="15s"
I1003 14:13:01.520229       1 flags.go:57] FLAG: --leader-elect-renew-deadline="10s"
I1003 14:13:01.520232       1 flags.go:57] FLAG: --leader-elect-resource-lock="leases"
I1003 14:13:01.520234       1 flags.go:57] FLAG: --leader-elect-resource-name="volcano"
I1003 14:13:01.520237       1 flags.go:57] FLAG: --leader-elect-resource-namespace="volcano-system"
I1003 14:13:01.520239       1 flags.go:57] FLAG: --leader-elect-retry-period="2s"
I1003 14:13:01.520242       1 flags.go:57] FLAG: --listen-address=":8080"
I1003 14:13:01.520244       1 flags.go:57] FLAG: --lock-object-namespace="volcano-system"
I1003 14:13:01.520247       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I1003 14:13:01.520251       1 flags.go:57] FLAG: --log-dir=""
I1003 14:13:01.520255       1 flags.go:57] FLAG: --log-file=""
I1003 14:13:01.520259       1 flags.go:57] FLAG: --log-file-max-size="1800"
I1003 14:13:01.520263       1 flags.go:57] FLAG: --log-flush-frequency="5s"
I1003 14:13:01.520267       1 flags.go:57] FLAG: --logtostderr="true"
I1003 14:13:01.520270       1 flags.go:57] FLAG: --master=""
I1003 14:13:01.520274       1 flags.go:57] FLAG: --minimum-feasible-nodes="100"
I1003 14:13:01.520284       1 flags.go:57] FLAG: --minimum-percentage-nodes-to-find="5"
I1003 14:13:01.520287       1 flags.go:57] FLAG: --node-selector="[]"
I1003 14:13:01.520295       1 flags.go:57] FLAG: --node-worker-threads="20"
I1003 14:13:01.520303       1 flags.go:57] FLAG: --one-output="false"
I1003 14:13:01.520306       1 flags.go:57] FLAG: --percentage-nodes-to-find="0"
I1003 14:13:01.520309       1 flags.go:57] FLAG: --plugins-dir=""
I1003 14:13:01.520312       1 flags.go:57] FLAG: --priority-class="true"
I1003 14:13:01.520315       1 flags.go:57] FLAG: --schedule-period="1s"
I1003 14:13:01.520319       1 flags.go:57] FLAG: --scheduler-conf="/volcano.scheduler/volcano-scheduler.conf"
I1003 14:13:01.520322       1 flags.go:57] FLAG: --scheduler-name="[volcano]"
I1003 14:13:01.520328       1 flags.go:57] FLAG: --skip-headers="false"
I1003 14:13:01.520331       1 flags.go:57] FLAG: --skip-log-headers="false"
I1003 14:13:01.520333       1 flags.go:57] FLAG: --stderrthreshold="2"
I1003 14:13:01.520338       1 flags.go:57] FLAG: --tls-cert-file=""
I1003 14:13:01.520340       1 flags.go:57] FLAG: --tls-private-key-file=""
I1003 14:13:01.520343       1 flags.go:57] FLAG: --v="3"
I1003 14:13:01.520348       1 flags.go:57] FLAG: --version="false"
I1003 14:13:01.520350       1 flags.go:57] FLAG: --vmodule=""
W1003 14:13:01.520367       1 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.


panic: failed init default queue, with err: admission webhook "mutatequeue.volcano.sh" denied the request: failed to get parent queue of open queue default: queues.scheduling.volcano.sh "root" not found

goroutine 1 [running]:
volcano.sh/volcano/pkg/scheduler/cache.newDefaultQueue({0x25de198, 0xc0000e3620}, {0x22a7c80, 0x7})
        /go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:513 +0x1d3
volcano.sh/volcano/pkg/scheduler/cache.newSchedulerCache(0xc00017d688, {0xc00005aad0, 0x1, 0x1}, {0x22a7c80, 0x7}, {0x0, 0x0, 0x0}, 0x14, ...)
        /go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:532 +0xfd
volcano.sh/volcano/pkg/scheduler/cache.New(...)
        /go/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:92
volcano.sh/volcano/pkg/scheduler.NewScheduler(0xc00017d688?, 0xc0000ec000)
        /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:70 +0xeb
volcano.sh/volcano/cmd/scheduler/app.Run(0xc0000ec000)
        /go/src/volcano.sh/volcano/cmd/scheduler/app/server.go:71 +0x1a5
main.main()
        /go/src/volcano.sh/volcano/cmd/scheduler/main.go:86 +0x325

Comment on lines 110 to 116
hierarchyEnabled := cp.HierarchyEnabled(ssn)
readyToschedule := true
if hierarchyEnabled {
readyToschedule = cp.buildHierarchicalQueueAttrs(ssn)
} else {
cp.buildQueueAttrs(ssn)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will placing these logic in onsessionopen have any impact on performance?

@hwdef
Copy link
Member

hwdef commented Oct 8, 2024

I got some errors when create a vcjob

Failed to create root queue, error: queues.scheduling.volcano.sh "root" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

@hwdef
Copy link
Member

hwdef commented Oct 8, 2024

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 8, 2024
pkg/scheduler/api/types.go Outdated Show resolved Hide resolved
@TaiPark
Copy link
Contributor

TaiPark commented Oct 15, 2024

In the scenario of scheduling directly with PodGroup (without Volcano Job), the logic for checking whether a Parent Queue is unable to submit is missing.

Maybe we need a webhooks.admission.podgroups.validate to check this?

@Rui-Gan
Copy link
Contributor Author

Rui-Gan commented Oct 15, 2024

In the scenario of scheduling directly with PodGroup (without Volcano Job), the logic for checking whether a Parent Queue is unable to submit is missing.

Maybe we need a webhooks.admission.podgroups.validate to check this?

Thank you for your suggestion. I will add the missing logic.

Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
@Rui-Gan Rui-Gan force-pushed the gr/queue branch 2 times, most recently from 005ab18 to f69ac4c Compare October 16, 2024 12:51
Signed-off-by: Rui-Gan <ganrui.cs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants